Point72 MI Data - Analytics Case Study

Introduction

Shown below is the problem statement of this case study.

311 requests

First, we'll look at the 311 calls dataset. This dataset in its raw form is quite large (7M rows, and 43 columns). To avoid running into memory issues, and keep runtimes reasonable, we can sample a chunk of the dataset instead of reading all of it. We can increase the sample size after the data exploration phase, if a need arises - such as if an analysis we're interested in requires more data points.

To analyze the 311 Service requests dataset, we will follow a structured approach to ingest, process, and aggregate the data, focusing on understanding the inbound call patterns. Here's an outline of the steps we'll take, the potential challenges we might encounter, and how we'll overcome them, followed by an analysis and visualization of our findings. The detailed steps described below are standard steps for data analysis, like we'd find in any reference material.

Step 1: Data Ingestion

Action: Load the dataset using Pandas.\ Challenge: The file size is very large, and leads to memory issues.\ Solution: Use chunking and randomly sample data points so we have span all the timeperiods.

Step 2: Preliminary Exploration

Action: Perform an initial exploration to understand the dataset's structure, including the number of records, features, missing values, and data types.\ Challenge: Identifying relevant features and handling missing or inconsistent data.\ Solution: Use descriptive statistics and visualization tools to assess data quality, and apply imputation or removal of missing data as appropriate.

Step 3: Data Cleaning

Action: Clean the dataset by handling missing values, correcting data types (e.g., converting timestamps to datetime objects), and removing duplicates.\ Challenge: Ensuring accurate data type conversions and dealing with outliers.\ Solution: Validate conversions through sample checks and use statistical methods to identify and handle outliers.

Step 4: Data Aggregation

Action: Aggregate the data to identify patterns, such as call volume over time, most common service requests, and geographic distribution of calls.\ Challenge: Choosing the right level of aggregation to reveal meaningful patterns without oversimplification.\ Solution: Experiment with different aggregation levels (e.g., daily, monthly, by neighborhood) and metrics (e.g., count, mean) to find the most insightful views.

Step 5: Analysis and Visualization

Action: Analyze aggregated data to uncover trends, seasonal patterns, and anomalies. Visualize findings using charts and maps.\ Challenge: Making complex data understandable and visually engaging.\ Solution: Use a combination of visualization techniques, such as time series plots, bar charts, heatmaps, and geographic maps.

Step 6: Implications and Insights

Action: Interpret patterns to infer implications for city management, resource allocation, and policy making.\ Challenge: Translating data patterns into actionable insights.\ Solution: Combine data analysis with domain knowledge to provide recommendations or insights. Let's start by ingesting the dataset and performing a preliminary exploration to understand its structure. We'll look at the number of records, features, and get a sense of the data we're dealing with. Then, we'll proceed with the subsequent steps based on our initial findings.

We'll jump right into it by importing libraries. Then we'll load and examine the data to understand its structure and contents. This examination will allow us to identify the best ways to aggregate the 311 records and describe the characteristics of the inbound call patterns. Let's start by loading the data and taking a look at the first few rows.

The dataset contains various columns, including identifiers, dates, agency details, complaint types, descriptors, location information, and more. Here's a brief overview based on the first few rows:

Exploratory data analysis

To describe the characteristics of the inbound call patterns, we can aggregate the data in several ways:

  1. Time Trends: Analyze complaint volume over time (daily, monthly, yearly).
  2. Complaint Type Distribution: Identify the most common types of complaints.
  3. Agency Response: Examine which agencies handle the most complaints.
  4. Location Analysis: Determine areas with high complaint volumes.
  5. Resolution Time: Calculate the time taken to close complaints.

For a comprehensive analysis, we'll start by exploring each of these aggregations.

Let's begin with the time trends to see how complaint volumes have changed over time. We'll look at the yearly and monthly complaint volumes.

The visualizations provide insights into the inbound call patterns based on the 311 records:

  1. Yearly Complaint Volumes: The bar chart shows the number of complaints for each year represented in the dataset. It looks like the number of complaints have increased over the 3 years.

  2. Monthly Complaint Volumes (Across All Years): The second bar chart aggregates complaint volumes by month, combining data from all years. Not really seeing any major trends here - this is somewhat surprising since I expected seasonal trends.

We can also do this weekly.

Complaint Type Distribution

We'll next look at the distribution of complaint types. This will help us understand the most common issues reported by residents. After that, we can explore the resolution times to gauge the efficiency of the agencies' responses.

The bar chart displays the top 20 complaint types based on their frequencies. This visualization helps us identify the most common issues that residents' report. As we can see, Noise - Residential and Heat/Hot Water are the most frequent types of requests.

Agency Name Distribution

We'll next look at the distribution of frequency of complaints by agency.

We see that NYCPD and Department of Housing Preservation and Development receive the highest number of complaints.

Which Borough has the highest requests?

Given that the different boroughs have different populations, we can get the per capita complaint frequency by normalizing the above counts by the population. The population of different Boroughs from a recent census is as follows -

Brooklyn: 2.6 million Queens: 2.3 million Manhattan: 1.6 million The Bronx: 1.4 million Staten Island: 0.5 million

Next, we'll calculate the per capita complaints/requests for each borough by dividing the total number of complaints/requests from each borough by its population.

Geographical locations of requests

Since we have the latitude, longitude of the reported incident, we can plot the locations and density on a map of New York (mostly because it's cool!).

Next, let's explore the resolution times for complaints to understand how quickly agencies respond to and resolve these issues. We'll calculate the resolution time as the difference between the 'Closed Date' and 'Created Date' and then analyze the average resolution time by complaint type. Since we have a large number of different types of complaints, we will look at the 20 complaints that have the fastest and slowest resolution times.

So we're seeing some negative resolution times, which is an error in the data. To address this issue of negative resolution times, which likely represent errors in the data (such as closed dates recorded before the created dates), we will first filter out these erroneous records. After correcting for negative resolution times, we'll re-analyze the average resolution times by complaint type, ensuring more accurate insights.

Okay that looks much better. Although we are seeing really quick resolution times (a few seconds), this is likely because some of these were marked as resolved on the initial phone call itself.

Weather Data

Now we'll jump into the weather data and start exploring this dataset.

The weather dataset contains 160,775 entries, with 20 columns detailing weather observations from New York between 2010 and November 2018. Here's a brief overview of some key columns:

For basic exploratory data analysis with plots, we could look into the following:

The plot illustrates the yearly average temperatures (mean, minimum, and maximum) in New York from 2010 through 2018. Each line represents the trend for mean, minimum, and maximum temperatures, providing insight into how the climate has varied over these years.

Looks like NY has received increased amounts of precipitation over the years.

The bar chart displays the yearly total precipitation in New York from 2010 through 2018. This visualization helps identify the years with higher or lower amounts of precipitation, indicating variability in rainfall and snowfall over the years.

The chart above illustrates the yearly wind patterns in New York from 2010 through 2018, showing both the average wind speed (in mph) and the maximum gusts recorded each year. The blue line represents the average wind speed, while the red line indicates the maximum gust speeds.

This visualization provides insight into the general wind conditions and highlights years with particularly strong gusts, reflecting the variability of wind intensity over the observed period.

Given the analyses conducted so far on the weather dataset, including temperature trends, precipitation patterns, and wind patterns, another insightful exploration could involve examining the relationship between weather conditions and specific weather events, such as days with significant precipitation or high wind speeds, and their occurrence over the years.

For this next step, let's focus on:

  1. Days with Significant Precipitation: Identify and visualize the number of days per year with precipitation exceeding a certain threshold, indicating heavy rainfall or significant snowfall events.\
  2. Days with High Wind Speeds: Analyze and plot the number of days per year with wind speeds exceeding a threshold, highlighting windy conditions or storms.

Days with Significant Precipitation

Let's define "significant precipitation" as days with precipitation amounts exceeding 0.5 inches, which often indicates heavy rainfall or significant snowfall events. We'll count these days per year to see any trends in their frequency over time.

The bar chart illustrates the number of days per year with significant precipitation (more than 0.5 inches) in New York from 2010 through 2018. This visualization helps identify years with higher frequencies of heavy rainfall or significant snowfall events, providing insight into variations in extreme weather conditions over the observed period.

Days with High Wind Speeds

Next, let's define "high wind speeds" as days with wind speeds exceeding 20 mph, which can indicate windy conditions or the presence of storms. We'll count these days per year to observe any trends in their occurrence over time.

Now that we've taken a quick look at patterns in the two sets of data, we can get into combining the two sets of data, understand correlations and eventually build predictive models to predict number of requests in the next 7 days - as per the original problem description.

First we'll look at monthly trends and then jump into daily trends.

As a first pass, let's look at if there's any correlation between precipitation and request count, and wind speed and request count.

We see a weak correlation between precipitation and request count, and almost no correlation between wind speed and request count.

Now, we see some requests with the word heat in them. We can check if there is any correlation between heat-related requests and temperatures.

Looks like the heat-related complaints are related to problems with the heaters at home - as these are peaking in winter times. This is unexpected, we can see below that there is a strong negative correlation between the two sets of data.

Now we'll move to daily trends, look at broad correlations and then get into modeling.

We have included all the available numerical columns in the weather dataset above, and aggregated at the level of each day. Below, we can look at the correlations between these high-level metrics and the corresponding request counts.

In general, we are seeing low correlations across the board with request count. The most correlated features with request count are SnowDepth and Gust.

Building the predictive model

Below, we will start working towards the predictive model -

We now have the weather metrics forecasted for the next 7 days. We'll use this as input in the next few steps to build our predictive model.

Please note the following important considerations during the model building process

Now, we will visualize the predictions. Since the original dataset also has data for these dates (2018-11-13 through 2018-11-19), we can compare the predictions from the models to this ground truth.

We see that all 3 models performed reasonably well - with Random Forest performing the best. We aren't capturing the large spikes in the request counts (as seen on 2018-11-16), but otherwise, our predictions aren't too bad.

Summary and Conclusion

We analyzed the 311 requests data from NY, looked at various attributes and patterns in the data. While there are numerous types of complaints/requests in the dataset, a subset of these requests could be influenced by weather patterns. We eventually looked at the weather dataset for the same time period, and then used the weather features to predict the number of daily requests for the next 7 days. We do this by joining the two datasets, creating multiple features from the available data to inform different regression models. Following are the main takeaways